[SOUND]
So average precision is computer for
just one.
one query.
But we generally experiment with many
different queries and this is to
avoid the variance across queries.
Depending on the queries you use you
might make different conclusions.
Right, so
it's better then using more queries.
If you use more queries then,
you will also have to
take the average of the average
precision over all these queries.
So how can we do that?
Well, you can naturally.
Think of just doing arithmetic mean as we
always tend to, to think in, in this way.
So, this would give us what's called
a "Mean Average Position", or MAP.
In this case,
we take arithmetic mean of all the average
precisions over several queries or topics.
But as I just mentioned in
another lecture, is this good?
We call that.
We talked about the different ways
of combining precision and recall.
And we conclude that the arithmetic
mean is not as good as the MAP measure.
But here it's the same.
We can also think about the alternative
ways of aggregating the numbers.
Don't just automatically assume that,
though.
Let's just also take the arithmetic
mean of the average position over
these queries.
Let's think about what's
the best way of aggregating them.
If you think about the different ways,
naturally you will,
probably be able to think about
another way, which is geometric mean.
And we call this kind of average a gMAP.
This is another way.
So now, once you think about
the two different ways.
Of doing the same thing.
The natural question to ask is,
which one is better?
So.
So, do you use MAP or gMAP?
Again, that's important question.
Imagine you are again
testing a new algorithm in,
by comparing the ways your old
algorithms made the search engine.
Now you tested multiple topics.
Now you've got the average precision for
these topics.
Now you are thinking of looking
at the overall performance.
You have to take the average.
But which, which strategy would you use?
Now first, you should also think about the
question, well did it make a difference?
Can you think of scenarios where using
one of them would make a difference?
That is they would give different
rankings of those methods.
And that also means depending on
the way you average or detect the.
Average of these average positions.
You will get different conclusions.
This makes the question
becoming even more important.
Right?
So, which one would you use?
Well again, if you look at
the difference between these.
Different ways of aggregating
the average position.
You'll realize in arithmetic mean,
the sum is dominating by large values.
So what does large value here mean?
It means the query is relatively easy.
You can have a high pres,
average position.
Whereas gMAP tends to be
affected more by low values.
And those are the queries that
don't have good performance.
The average precision is low.
So if you think about the,
improving the search engine for
those difficult queries,
then gMAP would be preferred, right?
On the other hand, if you just want to.
Have improved a lot.
Over all the kinds of queries or
particular popular queries that might be
easy and you want to make the perfect and
maybe MAP would be then preferred.
So again, the answer depends on
your users, your users tasks and
their pref, their preferences.
So the point that here is to think
about the multiple ways to solve
the same problem, and then compare them,
and think carefully about the differences.
And which one makes more sense.
Often, when one of them might
make sense in one situation and
another might make more sense
in a different situation.
So it's important to pick out under
what situations one is preferred.
As a special case of the mean average
position, we can also think about
the case where there was precisely
one rank in the document.
And this happens often, for example,
in what's called a known item search.
Where you know a target page, let's
say you have to find Amazon, homepage.
You have one relevant document there,
and you hope to find it.
That's call a "known item search".
In that case,
there's precisely one relevant document.
Or in another application,
like a question and answering,
maybe there's only one answer.
Are there.
So if you rank the answers,
then your goal is to rank that one
particular answer on top, right?
So in this case, you can easily
verify the average position,
will basically boil down
to reciprocal rank.
That is, 1 over r where r is the rank
position of that single relevant document.
So if that document is ranked
on the very top or is 1, and
then it's 1 for reciprocal rank.
If it's ranked at the,
the second, then it's 1 over 2.
Et cetera.
And then we can also take a, a average
of all these average precision or
reciprocal rank over a set of topics, and
that would give us something
called a mean reciprocal rank.
It's a very popular measure.
For no item search or, you know,
an problem where you have
just one relevant item.
Now again here, you can see this
r actually is meaningful here.
And this r is basically
indicating how much effort
a user would have to make in order
to find that relevant document.
If it's ranked on the top it's low effort
that you have to make, or little effort.
But if it's ranked at 100
then you actually have to,
read presumably 100 documents
in order to find it.
So, in this sense r is also a meaningful
measure and the reciprocal rank will
take the reciprocal of r,
instead of using r directly.
So my natural question here
is why not simply using r?
I imagine if you were to design
a ratio to, measure the performance
of a random system,
when there is only one relevant item.
You might have thought about
using r directly as the measure.
After all,
that measures the user's effort, right?
But, think about if you take a average
of this over a large number of topics.
Again it would make a difference.
Right, for one single topic, using r or
using 1 over r wouldn't
make any difference.
It's the same.
Larger r with corresponds
to a small 1 over r, right?
But the difference would only show when,
show up when you have many topics.
So again, think about the average of Mean
Reciprocal Rank versus average of just r.
What's the difference?
Do you see any difference?
And would, would this difference
change the oath of systems.
In our conclusion.
And this, it turns out that,
there is actually a big difference, and
if you think about it, if you want to
think about it and then, yourself,
then pause the video.
Basically, the difference is,
if you take some of our directory, then.
Again it will be dominated
by large values of r.
So what are those values?
Those are basically large values that
indicate that lower ranked results.
That means the relevant items
rank very low down on the list.
And the sum that's also the average
that would then be dominated by.
Where those relevant documents
are ranked in, in ,in,
in the lower portion of the ranked.
But from a users perspective we care
more about the highly ranked documents.
So by taking this transformation
by using reciprocal rank.
Here we emphasize more on
the difference on the top.
You know, think about
the difference between 1 and the 2,
it would make a big difference, in 1 over
r, but think about the 100, and 1, and
where and when won't make much
difference if you use this.
But if you use this there will
be a big difference in 100 and
let's say 1,000, right.
So this is not the desirable.
On the other hand, a 1 and
2 won't make much difference.
So this is yet another case where there
may be multiple choices of doing the same
thing and then you need to figure
out which one makes more sense.
So to summarize,
we showed that the precision-recall curve.
Can characterize the overall
accuracy of a ranked list.
And we emphasized that the actual
utility of a ranked list depends
on how many top ranked results
a user would actually examine.
Some users will examine more.
Than others.
An average person uses a standard measure
for comparing two ranking methods.
It combines precision and recall and
it's sensitive to the rank
of every random document.
[MUSIC]

